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Jh ■ Abstract 

<■ 

We obtain a tight distribution-specific characterization of the sample complex- 

■ ity of large-margin classification with L2 regularization: We introduce the 

7-adapted-dimension, which is a simple function of the spectrum of a distribu- 
tion's covariance matrix, and show distribution-specific upper and lower bounds 
on the sample complexity, both governed by the 7-adapted-dimension of the 

H— I 1 source distribution. We conclude that this new quantity tightly characterizes the 

c/3 . true sample complexity of large-margin classification. The bounds hold for a rich 

i O • family of sub-Gaussian distributions. 

> 

C<-) \ 1 Introduction 

in 

■ In this paper we tackle the problem of obtaining a tight characterization of the sample complexity 
which a particular learning rule requires, in order to learn a particular source distribution. Specif- 
ically, we obtain a tight characterization of the sample complexity required for large (Euclidean) 
margin learning to obtain low error for a distribution D(X, Y), for X G R d , Y G {±1}. 

Most learning theory work focuses on upper-bounding the sample complexity. That is, on pro- 
viding a bound m(D, e) and proving that when using some specific learning rule, if the sample 
size is at least m(D, e), an excess error of at most e (in expectation or with high probability) can 
be ensured. For instance, for large-margin classification we know that if Pd[||X|| < B] = 1, 
then m(D,e) can be set to 0(B 2 / '(7 2 e 2 )) to get true error of no more than £* + e, where 
£* — min|| w ||<i Pd(Y(w, X) < 7) is the optimal margin error at margin 7. 

Such upper bounds can be useful for understanding positive aspects of a learning rule. But it is 
difficult to understand deficiencies of a learning rule, or to compare between different rules, based 
on upper bounds alone. After all, it is possible, and often the case, that the true sample complexity, 
i.e. the actual number of samples required to get low error, is much lower than the bound. 

Of course, some sample complexity upper bounds are known to be "tight" or to have an almost- 
matching lower bound. This usually means that the bound is tight as a worst-case upper bound for 
a specific class of distributions (e.g. all those with Pd[||X|| < B] = 1). That is, there exists some 
source distribution for which the bound is tight. In other words, the bound concerns some quantity 
of the distribution (e.g. the radius of the support), and is the lowest possible bound in terms of this 
quantity. But this is not to say that for any specific distribution this quantity tightly characterizes the 
sample complexity. For instance, we know that the sample complexity can be much smaller than the 
radius of the support of X, if the average norm ,/E[[|X|| 2 ] is small. However, E[||X|| 2 ] is also not 
a precise characterization of the sample complexity, for instance in low dimensions. 

The goal of this paper is to identify a simple quantity determined by the distribution that does 
precisely characterize the sample complexity. That is, such that the actual sample complexity for the 
learning rule on this specific distribution is governed, up to polylogarithmic factors, by this quantity. 
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In particular, we present the 7-adapted-dimension fc 7 (D). This measure refines both the dimension 
and the average norm of X, and it can be easily calculated from the covariance matrix of X. We show 
that for a rich family of "light tailed" distributions (specifically, sub-Gaussian distributions with 
independent uncorrelated directions - see Section [2j, the number of samples required for learning 
by minimizing the 7-margin-violations is both lower-bounded and upper-bounded by O(fcy). More 
precisely, we show that the sample complexity m(e, 7, D) required for achieving excess error of no 
more than e can be bounded from above and from below by: 

n(k 7 (D))<m(e n ,D)<d(^l). 

As can be seen in this bound, we are not concerned about tightly characterizing the dependence of 
the sample complexity on the desired error [as done e.g. in 1], nor with obtaining tight bounds for 
very small error levels. In fact, our results can be interpreted as studying the sample complexity 
needed to obtain error well below random, but bounded away from zero. This is in contrast to 
classical statistics asymptotic that are also typically tight, but are valid only for very small e. As was 
recently shown by Liang and Srebro [2], the quantities on which the sample complexity depends on 
for very small e (in the classical statistics asymptotic regime) can be very different from those for 
moderate error rates, which are more relevant for machine learning. 

Our tight characterization, and in particular the distribution-specific lower bound on the sample 
complexity that we establish, can be used to compare large-margin (L2 regularized) learning to other 
learning rules. In Section|7]we provide two such examples: we use our lower bound to rigorously 
establish a sample complexity gap between L\ and L2 regularization previously studied in [ 3], and to 
show a large gap between discriminative and generative learning on a Gaussian-mixture distribution. 

In this paper we focus only on large L2 margin classification. But in order to obtain the distribution- 
specific lower bound, we develop novel tools that we believe can be useful for obtaining lower 
bounds also for other learning rules. 

Related work 

Most work on "sample complexity lower bounds" is directed at proving that under some set of 
assumptions, there exists a source distribution for which one needs at least a certain number of 
examples to learn with required error and confidence 

HHH. This type of a lower bound does 
not, however, indicate much on the sample complexity of other distributions under the same set of 
assumptions. 

As for distribution-specific lower bounds, the classical analysis of Vapnik [7, Theorem 16.6] pro- 
vides not only sufficient but also necessary conditions for the learnability of a hypothesis class with 
respect to a specific distribution. The essential condition is that the e-entropy of the hypothesis 
class with respect to the distribution be sub-linear in the limit of an infinite sample size. In some 
sense, this criterion can be seen as providing a "lower bound" on learnability for a specific distribu- 
tion. However, we are interested in finite-sample convergence rates, and would like those to depend 
on simple properties of the distribution. The asymptotic arguments involved in Vapnik's general 
learnability claim do not lend themselves easily to such analysis. 

Benedek and Itai (Hi show that if the distribution is known to the learner, a specific hypothesis 
class is learnable if and only if there is a finite e-cover of this hypothesis class with respect to the 
distribution. Ben-David et al. [9] consider a similar setting, and prove sample complexity lower 
bounds for learning with any data distribution, for some binary hypothesis classes on the real line. 
In both of these works, the lower bounds hold for any algorithm, but only for a worst-case target 
hypothesis. Vayatis and Azencott iflQII provide distribution-specific sample complexity upper bounds 
for hypothesis classes with a limited VC-dimension, as a function of how balanced the hypotheses 
are with respect to the considered distributions. These bounds are not tight for all distributions, thus 
this work also does not provide true distribution-specific sample complexity. 

2 Problem setting and definitions 

Let D be a distribution over M. d x {±1}. Dx will denote the restriction of D to R d . We are 
interested in linear separators, parametrized by unit-norm vectors in = {w € M d | ||w||2 < !}■ 
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For a predictor w denote its misclassification error with respect to distribution D by £(w, D) = 
^(x,y)~d[Y (w, X) < 0], For 7 > 0, denote the 7-margin loss of w with respect to D by 
£^(w,D) = P(x,y)~d[Y ( w > X) < 7]- Th e minimal margin loss with respect to D is denoted 
by £*(D) = min^gBd £ y (w,D). For a sample S = {(a^y,)}™! such that {x l ,y i ) £ R d x {±1}, 
the margin loss with respect to S is denoted by £^(w, S) = I Vi( x ii w ) < 7} I and the misclas- 
sification error is ^(w, 5) = I yi(xi,w) < 0}|. In this paper we are concerned with learning by 
minimizing the margin loss. It will be convenient for us to discuss transductive learning algorithms. 
Since many predictors minimize the margin loss, we define: 

Definition 2.1. A margin-error minimization algorithm A is an algorithm whose input is a 
margin 7, a training sample S = {(xi,yi)}iLi and an unlabeled test sample Sx — {xi}YLv 
which outputs a predictor w £ argmin^ggd £^(w, S). We denote the output of the algorithm by 
w = AjiS, S x ). 

We will be concerned with the expected test loss of the algorithm given a random training sample and 
a random test sample, each of size m, and define £ m (A 1 , D) = E s §r ^ Dm [£(A{S, S x ), S)], where 

S,S ~ D m independently. For 7 > 0, e £ [0,1], and a distribution D, we denote the distribution- 
specific sample complexity by m(e, 7, D): this is the minimal sample size such that for any margin- 
error minimization algorithm A, and for any m > m(e, 7, D), £ m (A 7 ,D) — £*(D) < e. 

Sub-Gaussian distributions 

We will characterize the distribution-specific sample complexity in terms of the covariance of X ~ 
Dx- But in order to do so, we must assume that X is not too heavy-tailed. Otherwise, X can 
have even infinite covariance but still be learnable, for instance if it has a tiny probability of having 
an exponentially large norm. We will thus restrict ourselves to sub-Gaussian distributions. This 
ensures light tails in all directions, while allowing a sufficiently rich family of distributions, as we 
presently see. We also require a more restrictive condition - namely that Dx can be rotated to a 
product distribution over the axes of M d . A distribution can always be rotated so that its coordinates 
are uncorrelated. Here we further require that they are independent, as of course holds for any 
multivariate Gaussian distribution. 

Definition 2.2 (See e.g. [Qj], EH])- A random variable X is sub-Gaussian with moment B (or 

B-sub-Gaussianj for B > if 

Vt £ K, E[exp(tX)] < exp(B 2 t 2 /2). (1) 
We further say that X is sub-Gaussian with relative moment p = B / ^fW^X 2 ]. 

The sub-Gaussian family is quite extensive: For instance, any bounded, Gaussian, or Gaussian- 
mixture random variable with mean zero is included in this family. 

Definition 2.3. A distribution D x over X £ R d is independently sub-Gaussian with relative 
moment p if there exists some orthonormal basis a\ , . . . , ad £ K d , such that (X, a,i) are independent 
sub-Gaussian random variables, each with a relative moment p. 

We will focus on the family 2?p 8 of all independently p-sub-Gaussian distributions in arbitrary di- 
mension, for a small fixed constant p. For instance, the family T>y 2 includes all Gaussian distribu- 
tions, all distributions which are uniform over a (hyper)box, and all multi-Bernoulli distributions, 
in addition to other less structured distributions. Our upper bounds and lower bounds will be tight 
up to quantities which depend on p, which we will regard as a constant, but the tightness will not 
depend on the dimensionality of the space or the variance of the distribution. 

3 The 7-adapted-dimension 

As mentioned in the introduction, the sample complexity of margin-error minimization can be upper- 
bounded in terms of the average norm E[||X|| 2 ] bym(e,7,L>) < 0(E[\\X\\ 2 }/ (^e 2 )) [13]. Alter- 
natively, we can rely only on the dimensionality and conclude m(e,7,D) < 0(d/e 2 ) Thus, 
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although both of these bounds are tight in the worst-case sense, i.e. they are the best bounds that 
rely only on the norm or only on the dimensionality respectively, neither is tight in a distribution- 
specific sense: If the average norm is unbounded while the dimensionality is small, an arbitrarily 
large gap is created between the true m(e, 7, D) and the average-norm upper bound. The converse 
happens if the dimensionality is arbitrarily high while the average-norm is bounded. 

Seeking a distribution-specific tight analysis, one simple option to try to tighten these bounds is to 
consider their minimum, min(d,E[||X|| 2 ]/7 2 )/e 2 , which, trivially, is also an upper bound on the 
sample complexity. However, this simple combination is also not tight: Consider a distribution in 
which there are a few directions with very high variance, but the combined variance in all other 
directions is small. We will show that in such situations the sample complexity is characterized not 
by the minimum of dimension and norm, but by the sum of the number of high-variance dimensions 
and the average norm in the other directions. This behavior is captured by the ^-adapted-dimension: 

Definition 3.1. Let b > and k a positive integer. 

(a) . A subset X C M. d is (b, &)-limited if there exists a sub-space V C M. d of dimension d — k 

such that X C {x € R d | ||x'P|| 2 < b}, where P is an orthogonal projection onto V. 

(b) . A distribution Dx over M. d is (b, fc)-limited if there exists a sub-space V C R d of dimen- 

sion d — k such that Ex^d x [||X'P|| 2 ] < b, with P an orthogonal projection onto V. 

Definition 3.2. The 7-adapted-dimension of a distribution or a set, denoted by fc 7 , is the minimum 
k such that the distribution or set is (7 2 fc, k) limited. 

It is easy to see that k 7 (Dx) is upper-bounded by min(<i,E[||X|| 2 ]/7 2 ). Moreover, it can be much 
smaller. For example, for X 6 R 1001 with independent coordinates such that the variance of the 
first coordinate is 1000, but the variance in each remaining coordinate is 0.001 we have k\ = 1 but 
d = E[j|X|| 2 ] = 1001. More generally, if Ai > X2 > ■ ■ ■ Xd are the eigenvalues of the covariance 
matrix of X, then fc 7 = min{fc | Yli=k+i ^ — 7 2 ^}' ^ quantity similar to fc 7 was studied 
previously in [Q. fc 7 is different in nature from some other quantities used for providing sample 
complexity bounds in terms of eigenvalues, as in [15], since it is defined based on the eigenvalues 
of the distribution and not of the sample. In Section|6]we will see that these can be quite different. 

In order to relate our upper and lower bounds, it will be useful to relate the 7-adapted-dimension for 
different margins. The relationship is established in the following Lemma , proved in the appendix: 

Lemma 3.3. For < a < 1, 7>0 and a distribution Dx, fc 7 (Z?x) < k ai {Dx) < 2k ''^i X ^ + L 
We proceed to provide a sample complexity upper bound based on the 7-adapted-dimension. 

4 A sample complexity upper bound using 7-adapted-dimension 

In order to establish an upper bound on the sample complexity, we will bound the fat-shattering 
dimension of the linear functions over a set in terms of the 7-adapted-dimension of the set. Recall 
that the fat-shattering dimension is a classic quantity for proving sample complexity upper bounds: 

Definition 4.1. Let J- be a set of functions f : X — > R, and let 7 > 0. The set {x±, . . . , x m } C X is 
^-shattered by J- if there exist n,...,r m 6l such that for all y £ {±l} m there is an f € J- such 
thatVi E [to], yi(f(xi) — r^) > 7. The ^-fat-shattering dimension of T is the size of the largest 
set in X that is ^-shattered by T. 

The sample complexity of 7-loss minimization is bounded by 0(d 7 / 8 /e 2 ) were d 7 / 8 is the 7/8- 
fat-shattering dimension of the function class [16, Theorem 13.4]. Let W(A') be the class of linear 
functions restricted to the domain X. For any set we show: 

Theorem 4.2. If a set X is (B 2 , k)-limited, then the ^-fat-shattering dimension ofW(X) is at most 
|(-B 2 /7 2 + k + 1). Consequently, it is also at most 3k 7 (X) + 1. 

Proof. Let X be a m x d matrix whose rows are a set of m points in R d which is 7-shattered. 
For any e > Owe can augment X with an additional column to form the matrix X of dimensions 
mx (d+ 1), such that for all y G {—7, +7}™ 1 , there is a w y g B d f* such that Xw y = y (the details 
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can be found in the appendix). Since X is (B 2 , fc)-limited, there is an orthogonal projection matrix 
P of size (d + 1) X (d + 1) such that Vi G [m], ||X?P|| 2 < S 2 where X t is the vector in row i of 
X. Let V be the sub-space of dimension d — k spanned by the columns of P. To bound the size of 
the shattered set, we show that the projected rows of X on V are 'shattered' using projected labels. 
We then proceed similarly to the proof of the norm-only fat-shattering bound lfl7ll . 

We have X = XP + X(I - P). In addition, Xw y = y. Thus y - XPw y = X(I - P)w y . 
I — P is a projection onto a k + 1-dimensional space, thus the rank of X(I — P) is at most k + 1. 
Let T be an m x m orthogonal projection matrix onto the subspace orthogonal to the columns 
of X(I — P). This sub-space is of dimension at most I — m — (k + 1), thus trace(T) = I. 
T{y - XPw y ) = TX{I - P)w y = (d+1)xl . Thus Ty = TXPw y for every y G {-7, +l} m - 

Denote row i of T by U and row i of TXP by z%. We have Vi < m, (zi,w y ) — Uy = 
J2j<m t i[j}y[j}- Therefore w*) 

t*[7']»[i]y[7']. Since \\w\\\ < 1 + e, 
Vx G R^+Sa + e) llxll > IMIKH > (x,^). ThusVy G {-7, +7}™ (1 + Ei ^Wll > 
Ei<m Ej<m Wybl- Taking the expectation of y chosen uniformly at random, we have 

(1 + e)E[|| z iVM > E nU[i\vmi\\ = I 2 E **W = 7 2 trace(T) = 7 2 /. 

i i,j i 

In addition, 4,E[|| Ei «*»[*] II 3 ] = ELi Ikill 2 = trace( J P'X'T 2 X J P) < lmce(P'X'XP) < B 2 m. 
From the inequality E[X 2 ] < E[X] 2 , it follows that I 2 < (1 + e) 2 ^-m. Since this holds for any 
e > 0, we can set e = and solve for m. Thus m < (k + 1) + + + -^(fc + 1) < 

+ + f + y^(fc + i)<|(f + + □ 

Corollary 4.3. Lef D be a distribution over X x {±1}, X C R d . T/zen 

m(e,7,D) < O I — ^ — J • 

The corollary above holds only for distributions with bounded support. However, since sub-Gaussian 
variables have an exponentially decaying tail, we can use this corollary to provide a bound for 
independently sub-Gaussian distributions as well (see appendix for proof): 

Theorem 4.4 (Upper Bound for Distributions in 2?p g ). For any distribution D over W 1 x {±1} such 
that D x G Vf, 

m(e,7,D) = 0{ '- ). 

This new upper bound is tighter than norm-only and dimension-only upper bounds. But does the 
7-adapted-dimension characterize the true sample complexity of the distribution, or is it just another 
upper bound? To answer this question, we need to be able to derive sample complexity lower bounds 
as well. We consider this problem in following section. 

5 Sample complexity lower bounds using Gram-matrix eigenvalues 

We wish to find a distribution-specific lower bound that depends on the 7-adapted-dimension, and 
matches our upper bound as closely as possible. To do that, we will link the ability to learn with 
a margin, with properties of the data distribution. The ability to learn is closely related to the 
probability of a sample to be shattered, as evident from Vapnik's formulations of learnability as a 
function of the e-entropy. In the preceding section we used the fact that non-shattering (as captured 
by the fat-shattering dimension) implies learnability. For the lower bound we use the converse fact, 
presented below in Theorem |5.1| If a sample can be fat-shattered with a reasonably high probability, 
then learning is impossible. We then relate the fat-shattering of a sample to the minimal eigenvalue 
of its Gram matrix. This allows us to present a lower-bound on the sample complexity using a lower 
bound on the smallest eigenvalue of the Gram-matrix of a sample drawn from the data distribution. 
We use the term '7-shattered at the origin' to indicate that a set is 7-shattered by setting the bias 
r G K m (see Def. 14. It to the zero vector. 
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Theorem 5.1. Let D be a distribution over M. d x {±1}. If the probability of a sample of size m 
drawn from to be "{-shattered at the origin is at least rj, then there is a margin-error minimization 
algorithm A, such that £ m /2{^'r, D) > rj/2. 

Proof. For a given distribution D, let A be an algorithm which, for every two input samples S and 
Sx, labels Sx using the separator w e argmin^ggd £j(w, S) that maximizes Eg yeDm [£ 1 (w, S)]. 
For every x € Mr there is a label y £ {±1} such that V(x,y)~d 

[Y ^ y \ X = x]> \. If the set of 
examples in Sx and Sx together is 7-shattered at the origin, then A chooses a separator with zero 
margin loss on S, but loss of at least h on S. Therefore £ m / 2 {A 1 ,D) >r]/2. □ 

The notion of shattering involves checking the existence of a unit-norm separator w for each label- 
vector y E {±l} m . In general, there is no closed form for the minimum-norm separator. However, 
the following Theorem provides an equivalent and simple characterization for fat-shattering: 

Theorem 5.2. Let S = {X\ , . . . , X m ) be a sample in WL d , denote X the mxd matrix whose rows are 
the elements of S. Then S is 1- shattered iff X is invertible and\fy G {±1}" 1 , y'(XX')~ 1 y < 1. 

The proof of this theorem is in the appendix. The main issue in the proof is showing that if a set is 
shattered, it is also shattered with exact margins, since the set of exact margins {±1}™ lies in the 
convex hull of any set of non-exact margins that correspond to all the possible labelings. We can now 
use the minimum eigenvalue of the Gram matrix to obtain a sufficient condition for fat-shattering, 
after which we present the theorem linking eigenvalues and learnability. For a matrix X, X n (X) 
denotes the n'th largest eigenvalue of X. 

Lemma 5.3. Let S — (X\, . . . , X m ) be a sample in M. d , with X as above. If X m {X X 1 ) > m then 
S is 1- shattered at the origin. 



Proof. If X m {XX') > m then XX' is invertible and X^XX')- 1 ) < 1/m. For any y e {±l} m 
we have \\y\\ = ^ and y'iXX'^y < ||y|| 2 Ai((XX') _1 ) < m(l/m) = 1. By TheoremEHthe 
sample is 1-shattered at the origin. □ 

Theorem 5.4. Let D be a distribution over R d x { ± 1 }, S be an i. i. d. sample of size m drawn from D, 
and denote Xg the mxd matrix whose rows are the points from S. IfP[X m (XgX' s ) > mj 2 ] > 77, 
then there exists a margin-error minimization algorithm A such that £ m /2(Ay, D) > rj/2. 



Theorem 15.41 follows by scaling Xs by 7, applying Lemma 15.31 to establish 7-fat shattering with 
probability at least rj, then applying Theorem |5.1| Lemma |B31 generalizes the requirement for linear 
independence when shattering using hyperplanes with no margin (i.e. no regularization). For unreg- 
ularized (homogeneous) linear separation, a sample is shattered iff it is linearly independent, i.e. if 
X m > 0. Requiring X m > m-f 2 is enough for 7-fat-shattering. Theorem 15.41 then generalizes the 
simple observation, that if samples of size m are linearly independent with high probability, there 
is no hope of generalizing from m/2 points to the other m/2 using unregularized linear predictors. 
Theorem |5.4| can thus be used to derive a distribution-specific lower bound. Define: 



nhy(D) = — min m 



's-DAXmiXsX's) >m 7 2 ] < X - 



Then for any e < 1/4 — £* (D), we can conclude that m(e, 7, D) > m 7 (D) , that is, we cannot learn 
within reasonable error with less than m 1 examples. Recall that our upper-bound on the sample 

complexity from Section |4] was 0(fc 7 ). The remaining question is whether we can relate m 7 and 
k 1 , to establish that the our lower bound and upper bound tightly specify the sample complexity. 



6 A lower bound for independently sub- Gaussian distributions 

As discussed in the previous section, to obtain sample complexity lower bound we require a bound 
on the value of the smallest eigenvalue of a random Gram-matrix. The distribution of this eigenvalue 
has been investigated under various assumptions. The cleanest results are in the case where m, d — > 
00 and r -j — > (3 < 1, and the coordinates of each example are identically distributed: 
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Theorem 6.1 (Theorem 5.1 1 in 11811 ). Let Xi be a series of mi x d\ matrices whose entries are i.i.d. 
random variables with mean zero, variance a 1 and finite fourth moments. 7/'limi_ i . 0O — /3 < 1, 

then lim^ A m (iX l X J ') = a 2 {l - y^) 2 - 

This asymptotic limit can be used to calculate m 7 and thus provide a lower bound on the sample 

complexity: Let the coordinates of X € M. d be i.i.d. with variance a 2 and consider a sample of size 
to. If d, m are large enough, we have by Theorem l6.ll 

X m (XX') w da 2 (I - ^/mjd) 2 = a 2 ^fd-yfm) 2 

Solving a 2 (yd — ^/2m 7 ) 2 = 2m 7 7 2 we get to 7 ps ~ci/ (1 + 7/cr) 2 . We can also calculate the 7- 
adapted-dimension for this distribution to get fc 7 «d/(l + j 2 /a 2 ), and conclude that |fe 7 < to 7 < 
ifc 7 . In this case, then, we are indeed able to relate the sample complexity lower bound with k 7 , the 
same quantity that controls our upper bound. This conclusion is easy to derive from known results, 
however it holds only asymptotically, and only for a highly limited set of distributions. Moreover, 
since Theorem 16.11 holds asymptotically for each distribution separately, we cannot deduce from it 
any finite-sample lower bounds for families of distributions. 

For our analysis we require finite-sample bounds for the smallest eigenvalue of a random Gram- 
matrix. Rudelson and Vershynin [19, 20] provide such finite-sample lower bounds for distributions 
with identically distributed sub-Gaussian coordinates. In the following Theorem we generalize re- 
sults of Rudelson and Vershynin to encompass also non-identically distributed coordinates. The 
proof of Theorem l6.2l can be found in the appendix. Based on this theorem we conclude with Theo- 
rem |6.31 stated below, which constitutes our final sample complexity lower bound. 

Theorem 6.2. Let p > 0. There is a constant /? > which depends only on B, such that for any 
5 £ (0,1) there exists a number Lq, such that for any independently sub-Gaussian distribution with 
covariance matrix E < I and frace(E) > Lq, if each of its independent sub-Gaussian coordinates 
has relative moment p, then for any to < (3 ■ frace(E) 

P[A m (X m X^) > m] > 1-5, 

Where X m is an m x d matrix whose rows are independent draws from Dx- 

Theorem 6.3 (Lower bound for distributions in 2?p S ). For any p > 0, there are a constant j3 > 
and an integer Lq such that for any D such that Dx € £>i g and k^(Dx) > Lq, for any margin 
7 > and any e < \ - I* (D), 

m(e, 1 ,D)>/3k J (D x ). 

Proof. The covariance matrix of Dx is clearly diagonal. We assume w.l.o.g. that E = 
diag(Ai, . . . , Xd) where Ai > ... > \d > 0. Let S be an i.i.d. sample of size m drawn from 
D. Let X be the m x d matrix whose rows are the unlabeled examples from S. Let 6 be fixed, and 
set /3 and Lq as defined in Theorem l6.2l for S. Assume m < /3(fc 7 — 1). 

We would like to use Theorem 16. 21 to bound the smallest eigenvalue of XX' with high probability, 
so that we can then apply Theorem |5.4| to get the desired lower bound. However, Theorem |6 . 2| holds 
only if all the coordinate variances are bounded by 1. Thus we divide the problem to two cases, 
based on the value of Afe +1, and apply Theorem l6.2l separatelv to each case. 

Case I: Assume A^ +1 > 7 2 . ThenVz e [ky],Xt > j 2 . Let Si = diag(l/Ai, . . . , 1/A*,, 0, . . . , 0). 
The random matrix X^/Tii is drawn from an independently sub-Gaussian distribution, such that 
each of its coordinates has sub-Gaussian relative moment p and covariance matrix E • E x < L^. In 
addition, trace (E- Ei) = k^ > Lq. Therefore TheoremlOholds for X VST, and P[A,» (XEi X') > 
to] > 1-5. Clearly, for any X, X m (^XX') > A m (XEiX'). Thus P[X m (^XX') > to] > 1-5. 
Case II: Assume Afc 7+ i < 7 2 . Then A, < 7 2 for all i € {fc 7 + l,...,d}. Let E 2 = 
diag(0, . . . , 0, 1/7 2 , . . . , I/7 2 ), with fc 7 zeros on the diagonal. Then the random matrix X^/T^. 
is drawn from an independently sub-Gaussian distribution with covariance matrix E • E2 < Ld, such 
that all its coordinates have sub-Gaussian relative moment p. In addition, from the properties of fe 7 

(see discussion in Section|2]i, trace(E-E2) = ^2 Si=jt + i Xi > k y — l > Lq — 1. Thus Theorem |6.2| 
holds forX,/^, and so F[X m (^XX') > to] > F[\ m (X£ 2 X') >m]>l-5. 
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In both cases P[A m (^H') > m] > 1 — 5 for any m < j3(k-y — 1). By Theorem l5.4l there exists 
an algorithm A such that for any m < — 1) — 1, £ m (A 7 , D) > \ — 5/2. Therefore, for any 
e < \ - 5/2 - £Z(D), we have m(e,j,D) > /3(fc 7 - 1). We get the theorem by setting 5 = \. □ 

7 Summary and consequences 

Theorem l4.4l and Theorem l6 . 3 I provide an upper bound and a lower bound for the sample complexity 
of any distribution D whose data distribution is in 2?p S for some fixed p > 0. We can thus draw the 
following bound, which holds for any 7 > and e G (0, \ — £*(D)): 

n(k 7 (D x )) < m(c, 7, D) < O(M^). (2) 

In both sides of the bound, the hidden constants depend only on the constant p. This result shows 
that the true sample complexity of learning each of these distributions is characterized by the 7- 
adapted-dimension. An interesting conclusion can be drawn as to the influence of the conditional 
distribution of labels D Y \x- Since Eq. (0 holds for any D Y \x, me effect of the direction of the best 
separator on the sample complexity is bounded, even for highly non-spherical distributions. We can 
use Eq. © to easily characterize the sample complexity behavior for interesting distributions, and 
to compare L2 margin minimization to learning methods. 

Gaps between L\ and L 2 regularization in the presence of irrelevant features. Ng 0] considers 
learning a single relevant feature in the presence of many irrelevant features, and compares using 
Li regularization and L2 regularization. When ||^||oo < 1, upper bounds on learning with L\ 
regularization guarantee a sample complexity of 0(log(d)) for an Li-based learning rule l2lll . In 
order to compare this with the sample complexity of L2 regularized learning and establish a gap, 
one must use a lower bound on the L2 sample complexity. The argument provided by Ng actually 
assumes scale-invariance of the learning rule, and is therefore valid only for unregularized linear 
learning. However, using our results we can easily establish a lower bound of VL(d) for many specific 
distributions with H^Hoo < 1 and Y — X[l] £ {±1}- For instance, when each coordinate is an 
independent Bernoulli variable, the distribution is sub-Gaussian with p = 1, and k\ = \d/2~\ . 

Gaps between generative and discriminative learning for a Gaussian mixture. Consider two 
classes, each drawn from a unit-variance spherical Gaussian in a high dimension M. d and with a 
large distance 2v >> 1 between the class means, such that d >> v 4 . Then Pd[X|F = y] = 
M{yv ■ e\, Id), where e% is a unit vector in M. d . For any v and d, we have Dx € Trf. For large 
values of v, we have extremely low margin error at 7 = v/2, and so we can hope to learn the 

classes by looking for a large-margin separator. Indeed, we can calculate fc 7 = \d/(l + V)], and 

conclude that the sample complexity required is Q(d/v 2 ). Now consider a generative approach: 
fitting a spherical Gaussian model for each class. This amounts to estimating each class center as 
the empirical average of the points in the class, and classifying based on the nearest estimated class 
center. It is possible to show that for any constant e > 0, and for large enough v and d, 0{d/v i ) 
samples are enough in order to ensure an error of e. This establishes a rather large gap of fl(v 2 ) 
between the sample complexity of the discriminative approach and that of the generative one. 

To summarize, we have shown that the true sample complexity of large-margin learning of a rich 
family of specific distributions is characterized by the 7-adapted-dimension. This result allows true 
comparison between this learning algorithm and other algorithms, and has various applications, such 
as semi-supervised learning and feature construction. The challenge of characterizing true sample 
complexity extends to any distribution and any learning algorithm. We believe that obtaining an- 
swers to these questions is of great importance, both to learning theory and to learning applications. 

Acknowledgments 

The authors thank Boaz Nadler for many insightful discussions, and Karthik Sridharan for pointing 
out fll4ri to us. Sivan Sabato is supported by the Adams Fellowship Program of the Israel Academy 
of Sciences and Humanities. This work was supported by the NATO SfP grant 982480. 



8 



References 

[1] I. Steinwart and C. Scovel. Fast rates for support vector machines using Gaussian kernels. Annals of 
Statistics, 35(2):575-607, 2007. 

[2] P. Liang and N. Srebro. On the interaction between norm and dimensionality: Multiple regimes in learn- 
ing. In ICML, 2010. 

[3] A.Y. Ng. Feature selection, h vs. h regularization, and rotational invariance. In ICML, 2004. 

[4] A. Antos and G. Lugosi. Strong minimax lower bounds for learning. Mach. Learn., 30(1):3 1— 56, 1998. 

[5] A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the number of ex- 
amples needed for learning. In Proceedings of the First Anuual Workshop on Computational Learning 
Theory, pages 139-154, August 1988. 

[6] C. Gentile and D.P. Helmbold. Improved lower bounds for learning from noisy examples: an information- 
theoretic approach. In COLT, pages 104-1 15, 1998. 

[7] V.N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. 

[8] Gyora M. Benedek and Alon Itai. Learnability with respect to fixed distributions. Theoretical Computer 
Science, 86(2):377-389, September 1991. 

[9] S. Ben-David, T. Lu, and D. Pal. Does unlabeled data provably help? In Proceedings of the Twenty-First 
Annual Conference on Computational Learning Theory, pages 33-44, 2008. 

[10] N. Vayatis and R. Azencott. Distribution-dependent vapnik-chervonenkis bounds. In EuroCOLT '99, 
pages 230-240, London, UK, 1999. Springer- Verlag. 

[11] D.J.H. Garling. Inequalities: A Journey into Linear Analysis. Cambrige University Press, 2007. 

[12] V.V. Buldygin and Yu. V. Kozachenko. Metric Characterization of Random Variables and Random Pro- 
cesses. American Mathematical Society, 1998. 

[13] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural 
results. In COLT 2001, volume 2111, pages 224-240. Springer, Berlin, 2001. 

[14] O. Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of 
Learning Algorithms. PhD thesis, Ecole Polytechnique, 2002. 

[15] B. Scholkopf, J. Shawe-Taylor, A. J. Smola, and R.C. Williamson. Generalization bounds via eigenvalues 
of the gram matrix. Technical Report NC2-TR- 1999-035, NeuroCOLT2, 1999. 

[16] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University 
Press, 1999. 

[17] N. Christianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University 
Press, 2000. 

[18] Z. Bai and J.W. Silverstein. Spectral Analysis of Large Dimensional Random Matrices. Springer, second 
edition edition, 2010. 

[19] M. Rudelson and R. Vershynin. The smallest singular value of a random rectangular matrix. Communi- 
cations on Pure and Applied Mathematics , 62:1707-1739, 2009. 

[20] M. Rudelson and R. Vershynin. The littlewoodofford problem and invertibility of random matrices. Ad- 
vances in Mathematics, 218(2):600-633, 2008. 

[21] T Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine 
Learning Research, 2:527-550, 2002. 

[22] G. Bennett, V. Goodman, and C. M. Newman. Norms of random matrices. Pacific J. Math., 59(2):359- 
365, 1975. 

[23] M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer, 1991. 

[24] F.L. Nazarov and A. Podkorytov. Ball, haagerup, and distribution functions. Operator Theory: Advances 
and Applications, 113 (Complex analysis, operators, and related topics):247-267, 2000. 

[25] R.E. A.C. Paley and A. Zygmund. A note on analytic functions in the unit circle. Proceedings of the 
Cambridge Philosophical Society, 28:266272, 1932. 



9 



A Proofs for "Tight Sample Complexity of Large-Margin Learning" 
(S. Sabato, N. Srebro and N. Tishby) 

A.l Proof of Lemma [331 

Proof. The inequality fc 7 < k ai is trivial from the definition of fc 7 . For the other inequality, note 
first that we can always let Ex~d x [XX'] be diagonal by rotating the axes w.l.o.g. . Therefore fc 7 = 

min{/c | Etfe+i ^ < 7 2 fc}- Since fc 7 < fc a7 , we have 7 2 /c 7 > Ei^+i A< > Etfc QT+ i A >- In 

addition, by the minimality of fc Q7 , E& A^ > a 2 j 2 (k aj — 1). Thus EiLfc +1 Ai - > a 2 7 2 (/c Q7 — 

1)— Xk ai - Combining the inequalities we get 7 2 /c 7 > a 2 j 2 (k aj — 1)— Xk ay - In addition, if fc 7 < fc Q7 

then 7 2 fc 7 > EiLfc A * — Thus, either fc 7 = fc Q7 or 2j 2 k^ > a 2 7 2 (fc Q7 — 1). □ 

A.2 Details omitted from the proof of Theorem l4.2l 

The proof of Theorem |4.2| is complete except for the construction of X and P in the first paragraph, 
which is disclosed here in full, using the following lemma: 

Lemma A.l. Let S = (Xi, . . . ,X m ) be a sequence of elements in W l , and let X be a m x d 
matrix whose rows are the elements of S. If S is ^-shattered, then for every e > there is a column 
vector r £ K d such that for every y £ {± 7 } m there is a w y £ Bf+l such that Xw y — y, where 
X = (X r). 

Proof, if S is 7-shattered then there exists a vector r £ K d , such that for all y £ {±l} m there exists 
w y £ Bf such that for all i £ [m],yi((Xi, w y ) — r*j) > 7. For e > define w y — (w y ,y/e) £ B i+C , 

and r = r/y/e, and let X = [X r). For every y £ {±l} m there is a vector t y £ R m such 
that V« £ [m], > 1> and ^Xw y = ^t y . As in the proof of necessity in Theorem 15. 21 it 

follows that there exists w y £ B>i +e such that ^Xw y = y. Scaling y by 7, we get the claim of the 
theorem. □ 

Now, Let X be a m x d matrix whose rows are a set of m points in M d which is 7-shattered. By 
Lemma IaTI for any e > there exists matrix X of dimensions m x (d + 1) such that the first d 
columns of X are the respective columns of X, and for all y £ {pmrf} m , there is a w 9 £ B d +\ 
such that Xw y = y. Since X is (B 2 , fc)-limited, there exists an orthogonal projection matrix P of 
size d x d and rank d — k such that V« £ [m], \\X^P\\ 2 < B 2 . Let P be the embedding of P in a 

(d + 1) x (d + 1) zero matrix, so that P is of the same rank and projects onto the same subspace. 
The rest of the proof follows as in the body of the paper. 

A.3 Proof of Theoremg3| 

Proof of Theorem \4.4\ Let £ = diag(Ai, . . . , Ad) be the covariance matrix of Dx, where V? £ 

[d - 1], A, > A<_i. Define X a = {x£R d \ T,t ky (D x )+i X W < a}. 

Let {xi}^l 1 be an i.i.d. sample of size m drawn from Dx- We will select a such that the prob- 
ability that the whole sample is contained in X a is large. P[V« £ [m], Xi £ X a ] = (1 — P[xi $ 

X a ]) m . Let X ~ D x - Then for all t > 0, F[X £ X a ] = P[Etfc 7 +i^H 2 > a \ < 

E[exp(tJ2tk 1+ iXm]eM~ta). 

Let A max = Afe T+ i. Define Y £ R d such that Y[i] = X[i]J^f. Then Etfc 7+ i*H 2 = 
EiLfc +1 \ Xi ^H 2 ' anc ^ by tn e definition of fc 7 , Ef=& +1 x^ - — X" 2- ■ Thus, by Lemma lA~2l 

d 

E[exp(i V X[i] 2 )] < max(E[exp(3ir[i] 2 )]) r ' c -' /w I 

^ — 4 i 
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For every i, Y[i] is a sub-Gaussian random variable with moment B = /OyAmax- By 11211 . Lemma 

1.1.6, E[exp(3ir[i] 2 )] < (1 — 6p 2 A max t) _ 3, for t G (Q, (6p 2 \ m a,x)~ 1 )- Setting^ 12p / Amax , 

P[* £ * Q ] < 2 fe -/ A — exp(-— ^ ). 



Muax 



Thus there is a constant C such that for 0(7) = C • p 2 (k~ t (D x ) + A max In ^), PLY ^ ^ a ( 7 )] < 
1 — Clearly, A max < k^(Dx), and k^(X a ^) < a(-/). Therefore, from Theorem 14.21 the 
7-fat-shattering dimension of W(X a ^) is 0(p 2 k 1 {Dx) In t). Define D 1 to be the distribution 
such that P D ^ [(X, Y)] = P Dx [(X, Y) | X G Af Q ( 7 )]. By standard sample complexity bounds lfl6Tl . 
for any distribution D over R d x {±1}, with probability at least 1 — | over samples, i m {A, D) < 

O ( \/ F ^ 8 ^f ln 7 ), where ^(7, D) is the 7-fat-shattering dimension of the class of linear functions 
with domain restricted to the support of 13 in R d . Consider Z3 7 /g. Since the support of -D 7 /s is 
^0(7/8)' -f (7/8) -D7/8) < 0(p 2 k^/ s (Dx) In y). With probability 1 — 5 over samples from .Dx , 
the sample is drawn from £> 7 /g. In addition, the probability of the unlabeled example to be drawn 

from X a ( j/8 ) is larger than 1 - i Therefore 4„(.4, D) < 0(^J p2k ~' /8i ° x) f ). Setting <5 = e/2 

and bounding the expected error, we get m(e, 7, £)) < 0( P k ' l/ ^ Dx ^ ). Lemma [33l allows replacing 
fc 7 / g with 0(fc 7 ). □ 

Lemma A.2. Lef Ti , . . . , foe independent random variables such that all the moments E[T"] /or 

a/Z i are non-negative. Let Ai, . . 
/or i G [d]. Then for allt>0 



all i are non-negative. Let Ai, . . . , Ad be real coefficients such that X)f=i = flnfl ' Aj G [0, 1] 



E[exp(tV A 4 T,)] < max(E[exp(3iTi)]) rL1 . 
Proof. Let be independent random variables. Then, by Jensen's inequality, 

d d d d x i d 

E[exp(i V XiTi)} = TT E[exp(tAiTi)] < TT E[exp(^ V Aj)] ^ Aj < maxE[exp(tTi V A,-)]. 

i=l i=l z=l 3=1 3=1 

Now, consider a partition Zi, . . . , Zk of [d], and denote Lj = Yliez Then by the inequality 
above, 

d k k 

E[exp{tJ2 x i T i)] = n E I cx P( t J2 XlT ^ - Hf^MexpitTiLj)}. 

i=i j=i ieZj 3=1 ' J 

Let the partition be such that for all j G [fc], Lj < 1. There exists such a partition such that Lj < | 
for no more than one / Therefore, for this partition L = 5Zi=i Ai = X^effe] A? — — -0- Thus 
fc< 27. + I. 



Now, consider E[exp(tTjLj)] for some i and j. For any random variable AT 

t n E[X n ] 
n\ 

n=0 



E[exp(tX)] = 



Therefore, E[cxp(tTi J L J )] = £^° =0 £l£H|H_. since E[X; n ] > for all n, and Zy < 1, it follows 
that E[exp(tTiij-)] < E[exp(fTi)]. Thus 

d k k 

E[exp(i V AiTi)] < TTmaxE[exp(tTi)] < maxE[exp(t V Tjj])], 
i=i 3=1 3=1 

where [j] are independent copies of Ti. 
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It is easy to see that E[exp[i Ylt=i — ^[ ex P[| Si=i -^i]]> f° r a > b and X\, . . . , X a i.i.d. 
random variables. Since k > \L] it follows that 

d k , \L] 

E[exp(tV XiTi)] < maxE[exp(ty Ti\j])\ < maxE[exp(t— V T t [j])]. 
1~1 f-{ <6H r^l jr[ 

Since fc < 2L + 1 and all the moments of Tj [j] are non-negative, it follows that 

d - T£l 

E[exp(i ]T AiTi)] < maxE[exp(<(2 + — ) £ T^])]. 
i=l lGl 1 ' ' j=l 



□ 



A.4 Proof of Theorem|53] 

the following lemma, which allows converting the representation of the Gram-matrix to a differ- 
ent feature space while keeping the separation properties intact. For a matrix M, M + denotes its 
pseudo-inverse. If (M'M) is invertible then M+ = (B'B)~ 1 B'. 

Lemma A.3. Let X be an m x d matrix such that XX' is invertible, and Y such that XX' = YY'. 
Let r G K m be some real vector. If there exists a vector w such that Yw = r, then there exists 
a vector w such that Xw = r and \\w\\ = \\Pw\\, where P = Y'Y' + = Y' (YY')~ 1 Y is the 
projection matrix onto the sub-space spanned by the rows ofY. 



Proof. Denote K = XX' = YY' . Set T = Y'X'+ = Y'K~ X X. Set w = T'w. We have 
Xw = XT'w = XX'K- 1 Yw = Yti = r. In addition, \\w\\ = w'w = w'TT'w. By definition 
of T, TV = Y'X'+X+Y = Y'K+Y = Y'K~ l Y = Y'{YY'Y X Y = Y'Y'+ = P. Since P 
is a projection matrix, we have P 2 = P. In addition, P = P'. Therefore TT' = PP', and so 
\\w\\=w'PP'w=\\Pw\\. □ 



The next lemma will allow us to prove that if a set is shattered at the origin, it can be separated with 
the exact margin. 

Lemma A.4. Let R = {r y e M m | y e {±1}™} such that for all y £ {±1}™ and for all i e [to], 
r v [i]y[i] > 1. Then My G {±1}'", y G conv{R). 



Proof. We will prove the claim by induction on the dimension m. 

Induction base: For m = 1, we have R — {(a), (&)} where a < — 1 and b > 1. Clearly, convi? = 
[a,b], and the two one-dimensional vectors (+1) and (—1) are in [a,b]. 

Induction step: For a vector t — , . . . , t[m]) G K m , denote by t its projection (t[l) , t[m — 
1]) on MT 1 - 1 . Similarly, for a set of vectors S C K m , let S = {s \ s G S} C E m_1 . Define 

Y + = {y G {±l} m | y[m] = +1} 

y_ = {y G {±i} m I y[m] = -l}. 

Let R + = {r y \ y G and similarly for Then R + and satisfy the assumptions for R 
when m — 1 is substituted for m. 

Let y* G {±l} m . We wish to prove y* £ conv(i?). From the induction hypothesis we have 
y* G conv(_R + ) and y* € conv(^_). Thus 

y* = 53 a v f 2/ = 2 ^ f «" 

where a y ,/3 y > 0, E ye y+ "j/ = L and T, v eY.Pv = L Let ^ = T, y eY + a v r v and I/? = 
SyeY a y r i/- We have that Vj/ G Y + ,r y [m] > l,andVy G F_,r y [m] < —1. Therefore, y* [m] > 1 
wAyl[m] < —1. In addition, y* a = y* = y. Hence there is 7 G [0, 1] such that y* = jy*+(l—j)y^. 
Since y* a G conv(i?+) and y\ G conv(i?_), we have y* G conv(i?). □ 
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Proof of Theorem\5J\ Let XX' = UAU' be the SVD of XX', where U is an orthogonal matrix 
and A is a diagonal matrix. Let Y = UA? . We have XX' = YY'. We show that the conditions are 
sufficient and necessary for the shattering of S. 

Sufficient: Assume XX' is invertible. Then A is invertible, thus Y is invertible. For any y £ 
{±1}™ Letw = F~V We have Yw = y. In addition, \\w\\ 2 = y'{YY')- l y = y' {XX^y <^\. 
Therefore, by Lemma |A3l there exists a separator w such that Xw = y and ||io|| = \\Pw\\ = \\w\\. 



Necessary: If XX' is not invertible then the vectors in S are linearly dependent, thus by standard 
VC-theory [16] S cannot be shattered using linear separators. The first condition is therefore nec- 
essary. We assume S is 1 -shattered at the origin and show that the second condition necessarily 
holds. Let L — {r | 3w £ Bf , Xw = r}. Since S is shattered, For any y £ {±1}™ 1 there exists 
r y £ L such that Mi £ \rri\,r y [i\y\j\ > 1. By Lemma |A.4| Vy £ {±l} m ,y £ conv(i?) where 
R = i r y I V e {±l} m }- Since L is convex and R C L, conv(i?) C L. Thus for all y £ {±l} m , 
y £ L, that is there exists w y £ M. m such that Xw y — y and ||u> v || < 1. From Lemma lAJl we thus 
have w y such that Yw y = y and \\w y \\ — \\Pw y \\ < \\w y \\ < 1. Y is invertible, hence w y = Y~ 1 y. 
Thus y'(XX')~ 1 y = y'(YY')~ 1 y — \\w y \\ < 1. □ 

A.5 Proof of Theorem^ 

In the proof of Theorem |6.2| we use the fact X m (XX') — inf || x || 2 =i ll^'^ll 2 an d bound the right- 
hand side via an e-net of the unit sphere in R m , denoted by S*™ -1 = {x £ R m | ||x||2 = 1}- An 
e-net of the unit sphere is a set C C S™ 1 ^ 1 such that Vx £ S m ~ 1 ,3x' £ C, \\x - x'\\ < e. Denote 
the minimal size of an e-net for S™ -1 by Af m (e), and by C m (e) a minimal e-net of S m ~ 1 , so that 
C m (e) Q S m ~ l and |C m (e)| = A/ m (e). The proof of Theorem |6.2| requires several lemmas. First 
we prove a concentration result for the norm of a matrix defined by sub-Gaussian variables. Then 
we bound the probability that the squared norm of a vector is small. 

Lemma A.5. Let Y be a d x m matrix with m < d, such that Yij are independent sub-Gaussian 
variables with moment B. Let £ be a diagonal d x d PSD matrix such that £ < J. Then for all 
t>0ande£ (0,1), 

,fr(E) t 2 {l-e) 2 



£F|| > t] < Af m (e)exp(- 



4B 2 



Proof. We have < max xeCm ( ( ) ||-\/£Y:r||/(l — e), see for instance in [22]. Therefore, 

p[||VeY|| > t] < ]T P[||Vsrx|| > (1 - e)t]. (3) 

xec m (e) 

Fix x £ C m (e). Let V = s/T,Yx, and assume S = diag(Ai, . . . , Ad). For u £ R d , 

E[exp((u,V))} = E[exp(^ ^ =~[[E[cxp(u l V\ l Y ij x j )] 

ie[d] je[rn] j.i 

< HeM^\ t B 2 x 2 /2) = exp(^ £ u 2 X t ]T x 2 ) 

3,i ie[d] je[m] 

= exp(^ ^ = exp((S 2 S M , u)/2). 

ie[d] 

Let s = 1/(4B 2 ). Since £ < /, we have s < 1/(4B 2 

max i6 ^] Xi). Therefore, by Lemma lA.9l (see 

Section lAT6l >, 

E[exp(s||F|| 2 )] < exp(2 Sj B 2 tr(£)). 
ByChernoff'smethod,P[||F|| 2 > z 2 ] < E[exp(s||l/|| 2 )]/ exp(sz 2 ). Thus 



F[||y|| 2 > z 2 } < exp(2 S i? 2 tr(E) - sz 2 ) = cxp(^ - ^) 



2 ^ ,2i / „„„^o„d2,„^A „,2 i = ,, X1H _ 

2 4B 

Set z = t(l- e). Then for all x £ S""" 1 



iTEFxIl > t(l e)] = P[||^|| > t(l - e)] < exp(^ - ^^)- 
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Therefore, by Eq. (01, 



VfY\\ > t] <M„(e)cxp(^ - tJ} ^)- 



□ 



Lemma A.6. Let Y be a d x m matrix with m < d, such that Yij are independent centered random 
variables with variance 1 and fourth moments at most B. Let Yibe a diagonal dxd PSD matrix such 
that E < I. There exist a > and r\ G (0, 1) that depend only on B such that for any x G S™ -1 

P[||\/eYie|| 2 < a • (fir(E) - 1)] < rf^. 

To prove Lemma lA.61 we require Lemma lA.71 ll20l Lemma 2.2] and Lemma |A. 81 which extends 
Lemma 2.6 in the same work. 

Lemma A.7. Let 2\ , . . . ,T n be independent non-negative random variables. Assume that there are 
9 > and /i € (0, 1) smc/i that for any i, P[Tj < 9] < /i. There are a > one/ ry G (0, 1) f/zaf 
depend only on 9 and /i such that PEiLi < an ] ^ 7 7™- 

Lemma A.8. Let Y be a d x m matrix with m < d, such that the columns ofY are i.i.d. random 
vectors. Assume further that Yij are centered, and have a variance ofl and a fourth moment at most 
B. Let Y.bea diagonal dxd PSD matrix. Then for all x G S" 1 ' 1 , ¥[\\VTYx\\ < y/tr(H)/2] < 
1-1/(1965). 

Proof. Let x G S" 1-1 , and T.- L — EJli YijXj) 2 . Let Ai, . . . , \d be the values on the diagonal of 

E, and let T s = HVE^H 2 = J2t=i ^ T *- First > since M Y v] = and E[K y ] = 1 for all i,j, we 
have E[Tj] = £ ie[m] zfE[Yg.] = ||^|| 2 = 1. Therefore E[T E ] = tr(E). Second, since • • • , Y m 
are independent and centered, we have [23, Lemma 6.3] 

E[lf] =E[( Y ijXj )*] < 16E„[( ^ ^l^-) 4 ], 

j'e[m] 36[m] 

where 01, ... , <r m are independent uniform {±1} variables. Now, by Khinchine's inequality l24ll . 

E„[( E ^ y i^i) 4 ] < 3E t( E y ^?) 2 ] = 3 E ^4 E Ki] E Kl]- 

je[m] je[m] j,ke[m] 



Now EK 2 ]EK 2 fe ] < ^E[y*]E[y*] < B. Thus E[Xf ] < 48B E jli€W *K = 48B||x|| 
48B. Thus, 

d d 
E[T 2 ] =E[£A,T,) 2 ] - E WP^-] 

i=l — 1 
d d 

< E AiA^E[Tf]E[T/] < 48B(E K) 2 = 48B • tr(E) 2 . 

i,j — l i—1 

By the Paley-Zigmund inequality i25ll , for € [0, 1] 

™[T S > 0E[T E ]] > (1 - 0) 2 |S > (1 ~ 



E[T 2 ] ~ 48B 

Therefore, setting = 1/2, we getP[T s < tr(E)/2] < 1 - 1/(196B). □ 

Proof of Lemma \A~6\ Let Ai, . . . , A<j G [0, 1] be the values on the diagonal of E. Consider a partition 
Z%, . . . , Zf. of [d], and denote Lj = ^2 ieZ . X%. There exists such a partition such that for all j G [k], 

Lj < 1, and for all j G [k — 1], > |. Let E[j] be the sub-matrix of E that includes the rows 
and columns whose indexes are in Zj. Let Y[j] be the sub-matrix of Y that includes the rows in Zj. 
Denote Tj = \\ ^/S\j]Y [j]x\\ 2 . Then 

m 

\\Vzy x \\ 2 = E E MJ2 Y « x tf = E T >- 

ja[k]iez 3 3=1 je[k] 
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We have tr(S) = Y*=i x i > Eje[fc-l] L i ^ U k ~ !)• In addition, Lj < 1 for all j G [k]. Thus 
tr(E) < k < 2tr(S) + 1. For all j G [jfe - 1], £j > §, thus by Lemma ES P[T 3 - < 1/4] < 
1 — 1/(1965). Therefore, by Lemma |A . 7 1 there are a > and 77 G (0, 1) that depend only on B 
such that 

P[|| Vsrirll 2 < a ■ (tr(S) - 1)] < P[||\/Era;|| 2 < a(k - 1)] 

= P[^ Tj < a{k - 1)] < P[ T i < a ( k - !)] ^ ^ ^ ^ 2 "" (S) - 

ie[fc] je[k-i] 

The lemma follows by substituting 77 for 77 2 . □ 

Proof of Theorem \6.2\ We have 

^X m {XX') = inf \\X'x\\> min ||-XT'ar|| - e||X'||. (4) 

xe g m -l xEC m (e) 

For brevity, denote L = tr(S). Assume L > 2. Let m < L ■ min(l, (c — Ke) 2 ) where c, A, e are 
constants that will be set later such that c — Ke > 0. By Eq. (0]l 

F[X m (XX') <m}< F[X m (XX') < (c - Ke) 2 L] 

<P[min ll^'xll -ellX'11 < (c-ife)VTl (5) 

xeC m (e) 

<P[pf'|| >KVl\ + P[ min llX'sll < (VZl. (6) 

a:6C m (e) 

The last inequality holds since the inequality in line Q implies at least one of the inequalities in 
line ©. We will now upper-bound each of the terms in line ©. We assume w.l.o.g. that £ is not 
singular (since zero rows and columns can be removed from X without changing X m (XX 1 )). Define 
Y = VTF^X'. Note that Y %1 are independent sub-Gaussian variables with (absolute) moment p. To 
bound the first term in line ©, note that by Lemma lA31 for any K > 0, 

p[||x'|| > kVI] = f[\\Vey\\ > kVl] < M m (h cx P (L(i - ^)). 

2 2 16p^ 

By d, Proposition 2.1, for all e G [0, 1], A/"„(e) < 2m(l + f)™" 1 . Therefore 

P[||X'|| > JfVT] < 2m5 m - 1 exp(L(i - ^)). 

Let K 2 = 16p 2 (| + ln(5) + ln(2/<5)). Recall that by assumption m < L, and L > 2. Therefore 

P[||X'|| > I<Vl] < 2m5 m ~ 1 exp(-L(l + ln(5) + ln(2/<5))) 
< 2L5 L - X exp(-Z(l + ln(5) + ln(2/5))). 

Since L > 2, we have 2Lexp(— L) < 1. Therefore 

P[||X'|| > A'VZ] < 2£exp(-£-ln(2/<5)) < cxp(- ln(2/d)) = -. (7) 

To bound the second term in line ©, since Yij are sub-Gaussian with moment p, E[Y4] < 5p 4 lfl2l 
Lemma 1.4]. Thus, by Lemma |A. 61 there are a > and 77 G (0, 1) that depend only on p such 
that for all x G S"™" 1 , P[||-\/xrKx|| 2 < a(L - 1)] < 7? L . Set c = a/o/ 7 ^. Since L > 2, we have 
c\/Z < - 1). Thus 

P[ min ||X'ar|| < cVT] < ^ < cVI] 

< ^ P[|| VXYx\\ < ^a{L - 1)] < N m (t)V L - 

x£C m (e) 
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Let e = c/(2K), so that c - Ke > 0. Let 9 — min(A 



L > 21n(2 1 ff 1 +^ ) ln(L) . For L > L and m < 0L < L/2, 



2' 21n(l+2/e)< 



Set L Q such that VX > L , 



AA m (e)r, L < 2m(l + 2/e) m - V 

< £exp(Z,(01n(l + 2/c) - ln(l/r/))) 

= exp(ln(L) + L(01n(l + 2/e) - ln(l/7?)/2) - L ln(l/??)/2) 

< exp(Z(01n(l + 2/e) - ln(l/r?)/2) + 1x1(6/2)) 

< exp(ln(V2)) = ~ 



(8) 
(9) 



Line © follows from L > L , and line @ follows from 01n(l + 2/e) - ln(l/? ? )/2 < 0. Set 
/3 = min{(c — Ke) 2 , 1,0}. Combining Eq. ©, Eq. (|7]i and Eq. Q we have that if L > L = 
max(L c , 2), then F[X m (XX') < m] < S_ for all m < /3L. Specifically, this holds for all L > and 
for all m < /3(L — L). Letting C — /3L and substituting 5 for 1 — 6 we get the statement of the 
theorem. □ 

A.6 Lemma lA79l 

Lemma A.9. Let X £ M. d be a random vector and let B be a PSD matrix such that for all u £ M. d , 

E[exp({u,V))] < exp({Bu,u)/2). 

Then for all t€ (0, 4X ^ B) ], E[exp(i||X|| 2 )] < exp(2t ■ trace(B)). 



Proof of Lemma P479] It suffices to consider diagonal moment matrices: If B is not diagonal, let 
V E M. dxd be an orthogonal matrix such that VBV' is diagonal, and let Y — VX. We have 
E[exp(i||r|| 2 )] = E[exp(t||X|| 2 )] andtr(VW) = tr(B). In addition, for all u G M d , 

E[exp((u,Y))] = E[cxp((V'u,X))] < 

exp(-{BV'u,V'u)) = exp(-{VBV'u,u)). 

Thus assume w.l.o.g. that B = diag(Ai, . . . , Ad) where Ai > . . . > A^ > 0. 

We have exp(£||X|| 2 ) = riiefrf] ex P(^[*] 2 )- in addition, for any t > and x e 
exp(te 2 ) = exp(sx — §r)<is. Therefore, for any u € R d , 



2vm • 



{2VlU) d ■E{exp{t\\X\\ 2 )}=E 



n 

i£[d] 



exp(w[z]X[ 



4f 



-)du[i) 



H exp(u\i]X\i] 



E 



ie[d] 



exp((u, X) — 



it 



-)e?u[i] 



4/ 



E[exp((u, X))} exp(- 



) n du ^ 

" ]2 )\{du[i 

ie[d] 



it 



By the sub-Gaussianity of X, the last expression is bounded by 



< 



... / exp(-(Bu,', 

-oo J -oo L 



it 



n exp ( 

-°°te[d] 



\iu\i] 



-) n ^ 

ie[d] 
,12 

-)du[i] 



it 
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ie[d] J -°° " ie[d\ 



The last equality follows from the fact that for any a > 0, exp(— a ■ s 2 )ds = y/U/a, and from 
the assumption t < j^. We conclude that 

d 

E[cxp(t||^|| 2 )] < ( H (1 2\it))--i < exp(2i ■ ^ A*) = cxp(2t • tr(B)), 

ie[d] i=l 

where the second inequality holds since Vx <E [0, 1], (1 — x/2) _1 < exp(x). □ 
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